Multi-Domain Sampling With Applications to 
Structural Inference of Bayesian Networks * 

Qing Zhou^ 



Abstract 

When a posterior distribution has multiple modes, unconditional expectations, such 
as the posterior mean, may not offer informative summaries of the distribution. Mo- 
tivated by this problem, we propose to decompose the sample space of a multimodal 
distribution into domains of attraction of local modes. Domain-based representations 
are defined to summarize the probability masses of and conditional expectations on 
domains of attraction, which are much more informative than the mean and other 
unconditional expectations. A computational method, the multi-domain sampler, is de- 
veloped to construct domain-based representations for an arbitrary multimodal distri- 
bution. The multi-domain sampler is applied to structural learning of protein-signaling 
networks from high-throughput single-cell data, where a signaling network is modeled 
as a causal Bayesian network. Not only does our method provide a detailed landscape 
of the posterior distribution but also improves the accuracy and the predictive power 
of estimated networks. 

Key words: Domain-based representation; Multimodal distribution; Monte Carlo; 
Network structure; Protein-signaling network; Wang-Landau algorithm. 

1 Introduction 

In Bayesian inference the information on an unknown parameter given an observed dataset 

yobs is contained in the posterior distribution p{6 \ yobs)- When a posterior distribution 
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does not belong to a well-characterized family of distributions, Markov chain Monte Carlo 
(MCMC) is a standard computational approach to Bayesian inference via sampling from 
the posterior distribution. Typical examples of MCMC include the Metropolis-Hastings 
(MH) algorithm (Metropolis et al. 1953, Hastings 1970) and the Gibbs sampler (Geman 
and Geman 1984, Gelfand and Smith 1990, Tanner and Wong 1987). Thorough reviews of 
recent developments on Monte Carlo methods and their applications in Bayesian computa- 
tion can be found in Chen et al. (2001) and Liu (2008). The posterior mean E{0 \ yobs) 
and other expectations are usually approximated from a Monte Carlo sample to summarize 
the posterior distribution. However, these unconditional expectations may not offer good 
summaries of the information for Bayesian inference when a posterior distribution has mul- 
tiple local modes. One can easily construct a multimodal posterior distribution of which 
the mean is located in a low-density region and thus using it as an estimator for 6 lacks a 
conventional interpretation. To extract more information contained in a multimodal poste- 
rior distribution, it is desired to identify all major modes and calculate various statistics in 
appropriate neighborhoods of these modes. 

To achieve these tasks, we propose to partition the sample space of 6 into a collection 
of domains such that the posterior distribution restricted to each domain is unimodal. The 
most parsimonious partition that minimizes the number of domains is to use the domains of 
attraction (to be defined rigorously later) of the local modes. Take the trimodal distribution 
p{6) in Figure [T] as an illustration. The space is partitioned into three domains, denoted 
by Gi,02 and ©3: Each domain contains exactly one local mode; if we move any € 
{k = 1,2,3) along the trajectory that always follows the gradient direction of p{0), it will 
eventually reach the mode in the domain 0^. We may then calculate various conditional 
expectations on these domains, E[h{d) \ 6 € Qk] (^ = l5 2,3), for different functions 
h. Together with the probability masses of the domains, P{6 G ©fc)) they provide more 
informative summaries of the distribution p{0) than unconditional expectations. Such a 
summary is called a domain-based representation (DR) for p{6). 

Although desired, construction of DRs for an arbitrary distribution is very challenging 
in practice. Sufficient Monte Carlo samples from domains of all local modes are necessary 
for estimating DRs, but efficient sampling from a multimodal distribution has always been a 
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Figure 1: Contour plot of a two-dimensional density with three modes labeled A, B and C. 
The numbers report log densities of contours and the dashed curves mark the boundaries 
between domains of attraction. 

hard problem. In this article, we develop a computational method that is able to construct 
domain-based representations for an arbitrary multimodal distribution. We partition the 
sample space into domains of attraction and utilize an iterative weighting scheme aiming at 
sampling from each domain with an equal frequency. The weighting scheme was proposed 
by Wang and Landau (2001), and further developed and generalized by Liang (2005), Liang 
et al. (2007) and Atchade and Liu (2010) among others. However, a direct application 
of these existing methods cannot provide accurate estimation of DRs, due to at least two 
reasons. First, sample space partition used in these methods is usually predetermined 
according to a set of selected density levels. But partitioning the space into domains of 
attraction, as employed in our method, cannot be completed beforehand because it is a 
nontrivial job to detect all local modes and their domains in real applications. Second, 
the above methods mostly rely on simple local moves and lack a coherent global move to 
jump between different local modes. To obtain accurate estimation of DRs, we propose 
a dynamic scheme to partition the sample space into domains of attraction and devise a 
global move that utilizes estimated DRs along sampling iterations to enable fast transitions 
between multiple domains. Since the main feature of our method is to sample from multiple 
domains and construct DRs, we call it the multi-domain (MD) sampler. 

The MD sampler can be applied to a wide range of Bayesian inference problems and 
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it is particularly powerful in tackling problems with complicated posterior distributions. 
Although there are many such applications in different fields, this article mainly concerns 
structural learning of causal Bayesian networks from experimental data. Learning network 
structure via Monte Carlo sampling is very challenging as the multimodality of the posterior 
distribution is extremely severe (Friedman and Koller 2003, Ellis and Wong 2008, Liang and 
Zhang 2009). In this problem, a domain of attraction is defined by a set of network struc- 
tures, each represented by a directed acyclic graph (DAG). In a sense, the goal of the MD 
sampler is to construct a detailed landscape of the posterior distribution, which can provide 
new insights into the structural learning problem. Application of our method to a scientific 
problem is illustrated by a study on constructing protein-signaling networks from single-cell 
experimental data. A living cell is highly responsive to its environment due to the exis- 
tence of widespread and diverse signal transduction pathways. These pathways constitute 
a complex signaling network as cross-talks usually exist between them. Knowledge of the 
structure of this network is critical to the understanding of various cellular behaviors and 
human diseases. Recent advances in biotechnology allow the biologist to measure the states 
of a collection of molecules on a cell-by-cell basis. Such large-scale data contain rich infor- 
mation for statistical inference of signaling networks, but powerful computational methods 
are needed given the complexity in the likelihood function and the posterior distribution. 
With the MD sampler, not only can we build a signaling network from the posterior mean 
graph, but also we may discover new pathway connections revealed by different domains of 
the posterior distribution, which are not accessible by other approaches. 

The remaining part of this article is organized as follows. Section [2] defines the domain 
of attraction and domain-based representation. In Section [3] we develop the MD sampler 
and its estimation of DRs, with convergence and ergodicity of the sampler established 
in Appendix. The method is tested in Section U] on an example in Euclidean space and 
implemented in Section [5] for Bayesian inference of network structure with a simulation 
study. Section [6] is the main application to the construction of signaling networks in human 
T cells. The article concludes with a discussion on related and future works. 
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2 Domain-based representation 

Let p(x), X S C M"^, be the density of the target distribution. Suppose that p(x) is 
differentiable and denote by Vp(x) the gradient of p at x. Define a differential equation 

^ = Vp(x(i)) (1) 

and write a solution path of this equation as x(t),i G [0, oo), where x(0) is a chosen initial 
point. Under some mild regularity conditions, x(oo) converges to a local mode of p(x), which 
is the basic intuition behind the gradient descent algorithm to maximize j'(x). Denote by 
{ui, . . . , Vk} all the local modes, including the global mode, of p(x). For x € let x(0) = x 
and define the domain partition index by 



/(x) 



k, if x(oo) = Vk, for k = 1, . . . , K 
0, otherwise. 



It maps X to the index of the local mode to which the solution path starting at x converges. 

Definition 1. The domain of attraction of Vj^ is D/^ = {x G ^ : /(x) = k} for k = 1, . . . ,K. 
For simplicity we may call D/^ an attraction domain or a domain of p. 

If the stationary points of p(x) have zero probability mass, then {D^ : k = 1, . . . , K} 
form a partition of the sample space X except for a set of zero probability mass. This is 
the default setting for this article. 

Let /i(x) be a p-integrable function of x. Write the probability mass of and the 
conditional expectation of /i(X) given X € as 

Afc = P(X e Dk) = [ p(x)dx, (3) 
^ih,k = ^[/i(X) \ XeDk] = ^ [ /i(x)p(x)dx, (4) 

^fc JDk 

respectively, for k = 1, . . . , K . 

Definition 2. The domain-based representation of h with respect to the distribution p is 
sl2x K array, DRp{h) = {{nh,k, Afc) : k = l,... , K}. 



The DR is equivalent to the probabihty mass function of E[h(X.) \ /(X)] that assigns 
probabihty Afc to fih,k ior k = I, . . . , K . It provides the expectation E[h{X.)] = )^kfJ'h,k 
and the decomposed contributions from the attraction domains of p(x). Such a represen- 
tation gives an informative low-dimensional summary of a multimodal distribution. For 
a complex distribution with many local modes, however, we cannot afford to estimate 
{fJ'h,k, Afc) for every domain when K is too large and are less interested in domains of negli- 
gible probability masses (Afc very close to zero). Due to these reasons we define domain-based 
representations with respect to a set of local modes {vk ■ k = 1, . . . , M}. Index all the local 
modes as Vi, . . . , vm, ■ ■ ■ , i^k < K). Define the domain partition index with respect to 
{^fclfcli by Ih[{y^) = /(x) if 1 < /(x) < M and Im{^) = otherwise. Then the sample 
space X can be partitioned into = {x € : /j\/(x) = k} for k = 0, . . . ,M, where 
-Dfc is the domain of i^k {k > 1) and Dq = X — Uff^]^ Dfc. The DR of h with respect to 
{^fclfcli is defined by the array {{fih,k, ^k) : k = 0,. . . ,M}, where Aq and fihfl are defined 
for Dq similarly as in Equations ([3]) and ([4]), respectively. Note that one can still obtain 
E[h{X.)] = ^flo ^ktJ'h,k after merging Dm+i, ■■■,Dk into Do- 
There is a geometric interpretation for the attraction domains of a posterior distribu- 
tion. Suppose that Y = (Yi, . . . , Y„) is a sample from an unknown distribution V'(y)- We 
assume a parametric family fg = /(• | 6), 6 £ Q, as the model for Y and put a prior vr(0) 
on the unknown parameter 6. The posterior distribution of 6 is given by 



when the sample size n is large, where £^[log/(Yi | 0)] = Jlog[/(y | 6)]'tlj{y)dy. Denote 
the Kullback-Leibler (KL) divergence between tp and fg by 



Then, p{6 | Y) cx expl—ndKLiipWfe)] when n is large. In the space of density functions, 
we can regard dKLi'tpWfe) as the "distance" to the point from a point in the manifold 
^A = {fg : 6 £ Q}. Note that tp is not necessarily in A4 if our model assumption on Y is 



n 



p{e\Y)^7r{e)llf{Y,\e) 



gnB[log/(Yil0)] 



i=l 
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incorrect. Then p{6 | Y) may be interpreted as a Boltzmann distribution on the manifold 
A4 under a potential field ndKLi'4'\\fe)- This potential pushes every fg, indexed by 6, 
towards tp, and the collection of 6 which will be driven to an identical stationary point in 
A4 forms an attraction domain of p{0 \ Y). 

3 The multi-domain sampler 

To develop an algorithm that is able to construct domain-based representations with respect 
to the target distribution p, it is necessary to identify the attraction domain of any x X. 
When p(x) is differentiable, this can be achieved by application of the gradient descent 
(GD) algorithm that finds local modes of p(x), or logp(x) for computational convenience. 
Generalization to the space of network structures will be discussed later. 

A naive two-step approach to the construction of DRs is quite obvious. We may first 
apply a Monte Carlo algorithm to simulate a sample {X*}"^]^ from p(x) or from a diffuse 
version of p(x), e.g., [p(x)]^/'^ for r > 1 as used in parallel tempering (Geyer 1991). Then, 
for every t we determine /(X*) by a GD search initiated at X* to find to which domain 
it belongs. This approach partitions the sample into attraction domains so that we can 
estimate the probability masses and conditional expectations for all identified domains. 
Although simple to implement, this two-step approach has a few drawbacks in terms of 
efficiency. Without a careful and specific design the Monte Carlo algorithm, even targeting 
at a diffuse version of p(x), may not generate enough samples from all major domains or 
may completely miss some modes. As a result, estimation on some attraction domains may 
be inaccurate or unavailable. In addition, this approach does not utilize the information 
on the target distribution provided by the constructed DRs. To overcome these drawbacks, 
we develop the MD sampler that may achieve simultaneously an efficient simulation from 
a multimodal distribution and an accurate construction of domain-based representations, 
with comparable computational complexity as the naive two-step approach. 
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3.1 The main algorithm 

We wish to sample sufficiently from the majority of attraction domains. However, the 
density at the boundary between two neighboring domains is often exponentially low (e.g., 
Figured]), which makes it difficult for an MH algorithm or a Gibbs sampler to jump across 
multiple domains. Thus, we need to allow the sampler to generate enough samples from 
such low-density regions that connect different domains. These considerations motivate the 
following double-partitioning design in the MD sampler. 

Suppose that we are given a set of local modes of p(x), {ui , • • • , i^m}, which may include 
the global mode. Given oo = Hq > Hi > . . . > Hl = — oo, define the density partition 
index J(x) = j if logp(x) G [Hj,Hj-i) for j = 1, . . . ,L. We partition the space X into 
(M + 1) X L subregions, 

Dkj = {x G ^ : /a/(x) = k, J(x) =j},k = 0,...,M,j = l,...,L, (5) 

where /Af(x) is the domain partition index with respect to {i^k}^i- Then, the attraction 
domain of the local mode Uk is = [Jj Dkj {1 < k < M). Note that some D^j may be 
empty; if logp{vk) < Hi then all Df^j for j < i are empty. In what follows, we only consider 
nonempty subregions. For a given matrix W = (tt'fcj)(j\/+i)xL) define a working density 

r Asn ?'(x)l(x G Dfcj) 

p(x;W)oc^^ , (6) 

k=Oj=l 

where l(-) is an indicator function. Let W* = {w^j) such that exp(w|,^) = Jj^^ p{x.)dx. 
Then, the probability masses of D/^j are identical under j>(x; W*). Sampling from p(x; W*) 
has two immediate implications. First, the sample sizes on the attraction domains, {Dj^lfcLi) 
will be comparable, and thus, domain-based representations can be constructed with a high 
accuracy. Note that commonly used MCMC strategies for multimodal distributions, such 
as tempering, cannot generate samples of comparable sizes from different domains. Second, 
the sampler will stay in low-density regions (e.g., -D/cl) for a substantial fraction of time, 
which makes it practically possible to jump between domains. Conversely, domain-based 
representations may be utilized to design efficient local and global moves for sampling from 
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p(x;W*). We may construct online estimate of the covariance matrix on the domain of 
a local mode, which can be used for tuning the step size of a local move in this domain. 
For a multimodal distribution, tuning step size for each domain is more useful than tuning 
the overall step size (Harrio et al. 2001). Once we have identified sufficient local modes 
and estimated covariances of their respective domains, we can use them to design global 
moves that may jump from one domain to another. As one can see, these proposals can be 
implemented only if we have partitioned samples into attraction domains. 

For X, y € X, let g(x, y) be a proposal from x to y and r(x;0,V) a distribution 
with parameters and V G V. We first develop the main algorithm of the MD sampler, 
assuming that the density ladder {Hj} is fixed and M local modes {i^k}kLi have been 
identified. Dynamic update of these parameters will be discussed in Section 13.21 as the 
burn-in algorithm. Let gfe(x) be a map from X to V for k = 1, . . . , M. 

Algorithm 1 (The main algorithm). Initialize = {wj^j), parameters G V (A: = 
1,. . . ,M), € [0, 1), and e X. Set 71 < 1. For t = 1, . . . , n: 

1. Draw Y from q{'X.^,y) with probability (1 — Pmx) or from the mixture distribution 

Efcii '^(y; '^fc, V^) with probability pmx- 

2. Determine the density partition index J(Y) and perform a GD search initiated at Y 
to determine the domain partition index /jvf (Y). 

3. Accept or reject Y according to the MH ratio targeting at p(x; W*) to obtain X*~^^. 

4. For k = 0, . . . , M and j = 1, . . . ,L update 

for /c = 1, . . . , M update 

Vt+^ = + I [gfc(X*+i) - V*] 1(/a/(X*+1) = fe), (8) 
and determine 74+1. 
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We may regard wl.j as a weight for the subregion D^j. After each visit to D^j, the 
weight w^kj^ increases by 74 unit ([7]), which decreases the probabihty mass of D^j under the 
working density p{x; W*''"^) Such a weighting scheme aims to visit every Df^j uniformly. 
There are two typical choices of {7*}. The first choice follows the standard design in 
stochastic approximation which employs a predetermined sequence such that X^^^ 7^ = 00 
and J2t^i^t < C!0 for C G (1,2) (Andrieu et al. 2005, Andrieu and Moulines 2006, Liang 
et al. 2007). The second design, originally proposed by Wang and Landau (2001), adjusts 
7t in an adaptive way. However, there is difficulty in establishing its convergence (Atchade 
and Liu 2010), and thus we adopt a modified Wang-Landau (MWL) design to update {74} 
in the MD sampler. Initialize cj,^ = for all k and j in Algorithm [TJ The MWL update at 
iteration t (t = 1, . . . , n) is given below. 

Routine 1 (MWL update). If 74 < e^, set jt+i = 7t(7t + 1)"^; otherwise: 

• Set = clj + l(X*+i G Dkj) for all k,j and calculate Ac^+J;^ = max^j - c*+^|, 
where c*"*"^ is the average of all c^^^- 

• If Ac^a^^ > r?c*+\ set 74+1 = 74; otherwise, set 74+1 = p7t and 4"}^ = for all k,j. 

That is, if 7^ > we decrease jt {p < 1) only when the sampler has visited every 
subregion D^j with a roughly equal frequency since our last modification of 7^. Let tc = 
min{t : 7^ < e^}. For t > tc the update becomes deterministic with 7^ = l/(t + ^), where 
^ = "f^^ — tc- The default setting for all the results in this article is given by p = 0.5, 
r] = 0.25 and = lO"'^. 

Under some regularity conditions and the MWL update of {7*}, 

exp('wL)-^ / p(x)dx = exp(i(jL), (9) 



after being normalized to sum up to one. 



„o fn Ki-(x)p(x; W*)dx A 
«(x;W*)(ix 



1 " r 

_V/i(X*)i^/ /i(x)p(x;W*)(ix, (11) 
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as n — > oo. See Theorem [2] in Appendix for more details. If A' C R™-, we often choose 
gfc(x) = (x — i>'fc)(x — i/fc)"*" so that is close to the covariance matrix of the conditional 
distribution [X | X € D^], where X ~ p(x; W*). We use the mode t'^ instead of the mean 
because the mode can be obtained accurately via a GD algorithm. The use of 7f/2 (< 1) in 
Equation ([8]) ensures that V^"^^ is positive definite if is positive definite. 

There are two types of proposals in step 1 of the algorithm, a local proposal g(X*,y) 
and a mixture distribution proposal. One advantage of partitioning samples into attraction 
domains is embodied in the mixture distribution proposal, in which we randomly draw a 
domain partition index k € {1,...,M} and then propose a sample Y from r(y; V^). 
Equal mixture proportions (1/Af) are used because a uniform sampling across domains is 
preferred. The default choice of the distribution ^(y; V^.) in is A/'(^'fc, V^) for k = 
1, . . . , M, which gives a mixture normal proposal that matches the mode and the covariance 
on each domain of the working target p(x; W*). This proposal uses a mixture distribution 
to approximate the multimodal target. It can generate efficient global jumps from one 
domain to another if p(x; W*) on the domain Dj. can be well approximated by r(x; V^) 
with the identified mode and the estimated V^. For simplicity we call this proposal the 
mixed jump. The typical design for gr(X*,y) in is to proposal Y ~ A/'(X*, cr^I), where 
cr^ is a scalar and I is the identity matrix. However, when the covariances are very different 
between domains, using a single local proposal may cause high autocorrelation, since the 
step size might be either too big for domains with small covariances or too small for those 
with large covariances, or both. In this case, we may incorporate an adaptive local proposal, 
Y 1^ A/'(X*, cj^V^^^^-j^t^), in addition to g'(X*,y), such that the learned covariance structure 
of a domain is utilized to guide the local proposal. This shows another advantage of the 
domain-partitioning design. 

Remark 1. We summarize the unique features of the main algorithm. First, domain parti- 
tioning is incorporated in the framework of the Wang-Landau (WL) algorithm. This allows 
a more uniform sampling from different domains, which facilitates construction of DRs. At 
each iteration, a GD search is employed to determine Ij\/(Y) and thus the computational 
complexity of this algorithm is comparable to the naive two-step approach. Second, an 
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adaptive global move, the mixed jump, is proposed given DRs constructed along the iter- 
ation, which utilizes identified modes and learned covariances to achieve between-domain 
moves. 

Remark 2. Verification of the regularity conditions for convergence of the algorithm (see 
Appendix) is recommended before application. Furthermore, we suggest a few convergence 
diagnostics that can be conveniently used in practice. First, 7„ should be small enough at 
the final iteration and the frequency of visiting different should be roughly identical. 
Second, W" and should have converged with an acceptable accuracy. Violations of 
these two criteria indicate that more iterations may be necessary. Third, the adaptive 
parameters used in the mixed jump (V^) should always stay in a reasonable range. For 
example, if is a covariance matrix, one may check whether its eigenvalues are close to 
zero or unreasonably large, which may indicate divergence of the current run. If the last 
criterion is not satisfied, it is suggested to reinitialize the MD sampler with a smaller 71. 

3.2 The burn-in algorithm 

In practical applications of the MD sampler, the density ladder {Hj} and the local modes 
{vk} are updated dynamically in a burn-in period before the main algorithm (Algorithm [1]) . 
The dynamic updating schemes are crucial steps for constructing domain-based representa- 
tions in real applications, as one cannot partition the sample space into domains of attraction 
beforehand. We set Hi, . . . , Hl-i as an evenly spaced sequence so that AH = Hj — Hj^i is a 
constant for j = 1, . . . , L — 2. Let {Hj} be the density ladder and A* = {uj^ : k = I, . . . , M*} 
be the set of M* identified modes at iteration t. Let K* be the maximum number of modes 
to be recorded and denote by i/^^) the mode of the domain that x belongs to. Let be the 
zero matrix with dimension determined by the context. The following routine is used to 
update A* when a new sample Y is proposed. 

Routine 2. Let s* = argmin p{i^l)- 

l<fc<M* 

• If i^(Y) A* and M* < K*, set A*+i = A* U {i^(y)}, M^^^ = M* + 1, and initialize 
w^t+ij = for all j; 
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• if i/(Y) i A*, M* = K* and p(i/(y)) > p(i^*t), set = i/(Y), i^^"^^ = for k ^ s*, 

M*^-'^ = M*, and assign Wg^- <^= + w**^- and w't^. <^= for all j. 

• otherwise set A*+^ = A* and M*+^ = M*. 

According to this routine, we record at most the K* highest modes identified during 
the burn-in period. If there are more than K* modes, Algorithm [T] will construct DRs with 
respect to the recorded modes. The weights are updated when a new mode replaces 

an old one in A*, for which the assignment operator '<^=' is used to distinguish from equality. 

The density ladder {-f^j} is adjusted such that the lower bound of the highest 
density partition interval, is close to logn*, where u* is the density of the highest mode 
identified so far. If log u* > li\ + A// we move upwards the density ladder by Ai? unit and 
update the weights (vj\,^ accordingly, with details provided in Routine[3l This strategy helps 
the sampler to explore the high-density part, which is important for statistical estimation 
and finding the global mode. 

Routine 3. Given A*+\ find = max{p(i^*+^) : A; = 1, . . . ,M*+i}. 

• If log >H\ + Ai?, set Hf^ = + Ai? for j = 1, . . . , L - 1; for A; = 0, . . . , M*+\ 
assign w\j^ <= + w*^, w^j <= w^^O-i) for i = L - 1, . . . , 2, and wl-^ <= 0; 

• otherwise set = {Hj}. 

Algorithm 2 (The burn-in algorithm). Input L,AH and K* . Set 70 = 1. Initialize 
Xi G ;f, Ai = {iv(xi)}, Ml = 1, = K,)2xL = 0, and Vj. Set Hi = logp(i.(xi)) and 
Hj = H]^^ - A/7 for j = 2, . . . , L - 1. For t = 1, . . . , 5: 

1. Draw Y ~ (7(X*,y) and find ^'(y) by a GD search. 

2. Given update A*^^ and M*"*""*^ by Routine E] if t^^^^ = is a new mode in 
A*+^ initialize Y\. Given A*+\ update {H*^^} by Routine El 

3. Given A'^^ and {-ff*^^}, accept or reject Y with the MH ratio targeting at p(x; W*) 
to obtain X*+^ 

4. Execute step 4 of Algorithm [T] with 7^ = 79. 
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Remark 3. Note that 7t = 1 for every iteration in the burn-in algorithm. This pushes the 
sampler to explore different regions in the sample space so that more local modes can be 



iteration, which is the reason for our updating schemes on {u^^^} in Routine [2] when a mode 
is updated in A*+^ and in Routine [3] when the density ladder changes. 

Remark 4. The burn-in algorithm can be used as an optimization method that searches 
for up to K* local modes of the highest densities. As demonstrated in the Bayesian network 
applications, this algorithm is very powerful in finding global modes. 

The MD sampler requires only a few input parameters, L, AH,pmx: and K* . A practical 
rule is to choose L and AH such that the range of the density partition intervals, LAH in log 
scale, is wide enough to cover important regions. In this paper, we set LAH around 20 for 
the low-dimensional test example in Section |4] and around 200 for learning Bayesian networks 
in Sections [5] and m By default the probability of proposing a mixed jump = 0.1. The 
effect of keeping only K* modes will be studied later with the examples. 

3.3 Statistical estimation 

The domain-based representation of h is constructed by estimating ([3]) and fj,h,k © for 
A; = 0, . . . , M with post burn-in samples, denoted by {X^+^jJL^. Let k* = /j\,/(X*+-'^), j* = 
J(X*+i), at = Efc,jexp('Wfcj), and exp(u}|,^.) = exp(u;|,^.)/«t such that Efcj exp(u;^^.) = 1. 
The key identity for our estimation is 



cally Q. See Liang (2009) and Atchade and Liu (2010) for similar results. However, 
W* = {wl,j) may be far from W* even for post burn-in iterations. Thus, it is desired to 
use a weighted version of (jl2p so that X*^"^ will carry a higher weight if W* is closer to 
W*. Since decrease in indicates convergence of the MD sampler and at+i/at -^^e'^'^* for 
c G (0, 1) (supplementary document), a reasonable choice is to weight X*+^ by at so that 



identified. In this case, the weight im! . records the number of visits to D^j before the tth 




(12) 




) /p(X*+i ; W*) asymptoti- 
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unnormalized (w^^) will be used in ([T2]). Consequently, DRp{h) is constructed with 

Er=il(X*+ieDfc)exp(Tz;*,..) 



A 



k 



f''h,k 



Er=i^(X*+^)l(X*+^ e Dfc)expK,.,) 



for /c = 0, ...,M. Then, /i^ = i?[/i(X)] is estimated by fth = Ylk^kp'h,k- Please see 
supplementary document for more discussion on this weighted estimation. 

In the next three sections we demonstrate the effectiveness of the MD sampler in sta- 
tistical estimation, especially estimation of DRs, compared to the naive two-step approach. 
For all examples, we employ the WL algorithm with the MWL update (Routined]) as the 
Monte Carlo method in the two-step approach. To minimize hidden artifacts in a compar- 
ison due to coding differences, we implement the WL algorithm with the same burn- in and 
main algorithms of the MD sampler. In the main algorithm (Algorithm [T|) we replace the 
updating scheme in Equation ([7]) with 



4,+7tl(J(X*+i)=i) (13) 



for k = 0, . . . , M and j = 1, . . . , L and modify the burn-in algorithm accordingly, so that 



Wqj = ■ ■ ■ = w\,j- = Wj for every iteration. Consequently, the working density is effectively 

p(x;WVE "'"'^'-''1°^' (14) 

^ exp(u;;) 

as used in the WL algorithm, which is a diffuse version of p(x) such that each density 
partition interval will be equally sampled after convergence. Note that the same CD search 
is applied at each iteration to partition samples into attraction domains for estimating DRs. 
Our comparison aims to highlight the effect of domain partitioning and the mixed jump in 
the MD sampler which are the key differences from the WL algorithm. 
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4 A test example 

We test the MD sampler with an example in M"*. For this example, domain-based repre- 
sentations can be obtained via one-dimensional numerical integration with a high accuracy, 
which provides the basis to evaluate our estimation. We choose K* = 100, which is greater 
than the total number of local modes, to construct complete DRs. 

Let X = (xi, . . . , Xm)- The Rastrigin function (Gordon and Whitley 1993) is defined as 



where ^4 is a positive constant. We set A = 2 and m = 4 in (jl5p to obtain our target 
distribution p(x) oc exp[— ii(x)], which has 3^ = 81 local modes formed by all the elements of 
the product set {—1.805, 0, 1.805}^. These local modes have five distinct log density values, 
0, —3.62, —7.24, —10.87, and —14.49, dependent on the combinations of their coordinates. 
They are grouped accordingly into five layers so that the number of zeros and the number 
of ±1.805 in the coordinates of a local mode at the A;th layer are (5 — k) and {k — 1), 
respectively, for k = 1,...,5. The attraction domains of local modes at the same layer 
have identical probability masses and identical conditional means up to a permutation and 
change of signs of the coordinates. 

We applied the MD sampler 100 times independently, each run with L = 10 density 
partition intervals, AH = 2, B = 50K burn-in iterations and a total of 5 million (M) 
iterations (including the burn-in iterations). The local proposal was simply 7V(X*,I). The 
average acceptance rate was 0.26 for the local move and was 0.56 for the mixed jump. Let 



X = (Xi, . . . , X4) and S = J2^ Xi- We estimated ^(X), ^(e^^), ^(Jlj ^i), -^(Ei ^i)^ and 



E{^- Xf), all via domain-based representations. Since the target density of this example is 
a product of one-dimensional marginal densities, the above expectations can be calculated 
accurately through one-dimensional numerical integration. We compared our estimates 
from MD sampling with the results from numerical integration by computing mean squared 
errors (MSEs). We report the average MSE of the estimated log probability masses (log Afc) 
and the average MSE of the estimated conditional means (/^x,fc) over all the local modes 



m 



m 




(15) 



1=1 



i=l 
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Table 1: MSE comparison on the Rastrigin function 
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log 
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0.59 


3.25 3.09 


log 


A3 


3.5e-3 


4.64 


3.22 
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2.8e-4 


6.16 


3.34 




1.6e-9 


2.84 2.70 


log 


■ A4 


3.3e-3 


8.63 


3.92 


MX,4 


2.9e-4 


12.0 


4.09 




6.1e-3 


2.71 2.06 


log 


■As 


3.2e-3 


16.8 


4.66 


AtX,5 


3.3e-4 


21.1 


5.06 




0.11 


1.95 2.26 



at the same layer (k = 1, . . . , 5), and for other functions we only report the MSEs of the 
estimated expectations to save space (Table [T]). 

As a comparison, we also applied the WL algorithm (as in the naive two-step approach) 
to this problem with the same parameter setting. The ratio (RMSE) of the MSE of the 
WL algorithm over that of the MD sampler for each estimate is given in Table [H The 
WL algorithm showed larger MSEs than the MD sampler for almost all the estimates, 
especially for those on domains at layers 3, 4 and 5. For example, the MD sampler was at 
least 16 times more efficient than the WL algorithm for estimating logAs and /^x,5- The 
WL algorithm did not simulate sufficient samples from these domains, although it visited 
uniformly different density partition intervals. On the contrary, the double-partitioning 
design facilitated the MD sampler to explore every domain in a uniform manner, which 
led to a substantial improvement in estimation for these layers. This shows the critical 
role of domain partitioning in estimating DRs. To study the effect of the mixed jump, we 
re-applied the MD sampler with pmx = 0, and calculated the ratio of the resulting MSE 
(MDo in Table [1]) over that of the MD sampler with = 0.1, the default setting. One 
sees an increase of two folds or more in MSEs without the mixed jump. The convergence 
of the MD sampler without the mixed jump became slower, reflected by a five-fold increase 
in 7n after the same number of iterations, averaging over 100 independent runs. These 
observations demonstrate that the mixed jump served as an efficient global move which 
accelerated convergence of the MD sampler and improved estimation accuracy. 
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5 Learning Bayesian networks 

A Bayesian network (BN) factorizes the joint distribution of m variables Z = {Zi, . . . , Zm} 
into 

m 

P{Z) = J{P{Zi\Tlf), (16) 

i=l 

where Hp C Z is the parent set of Z^. A graph G is constructed to code the structure of 
a BN by connecting each variable (node) to its child variables via directed edges. For (I16p 
to be a well-defined joint distribution, the graph G must be a DAG. We consider the use 
of Bayesian networks in causal inference (Spirtes, Glymour and Scheines 1993, Pearl 2000), 
which is tightly connected to many areas in statistics, such as structural equations, potential 
outcomes, and randomization (Holland 1988, Neyman 1990, Rubin 1978, Robins 1986). Here 
we follow Pearl's formulation of causal networks by modeling experimental intervention. If 
Zj is a parent of Zi in a causal Bayesian network, then experimental interventions that 
change the value of Zj may affect the distribution of Zi, but not conversely. Once all 
the parents of Zi are fixed by intervention, the distribution of Zi will not be affected by 
interventions on any variables in the set Z \ (Hp U {Zi}). In the example causal network of 
Figure [2l if we fix Zi and Z^ by experimental intervention, then the distribution of Z4 will 
not be affected by perturbations on Z2, Z^, or Zq. 




Figure 2: An example Bayesian network of six variables. 
5.1 Posterior distribution 

We focus on the discrete case where each Zi takes states indexed by 1, . . . ,ri and the 
parents of Zi take qi = YlzjeWr^ joint states. Let Oij^ be the causal probability for 
Zi = j given the kth joint state of its parent set. A causal BN with a given structure G is 
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parameterized by = {6ijk : J2j ^ijk = 'i-,Oijk > 0}. 

We infer network structure from two types of data jointly, experimental data and 
observational data. For experimental data, a subset of variables are known to be fixed by 
intervention. Inferring causality with intervention has been extensively studied in various 
contexts (e.g., Robins 1986, 1987, Pearl 1993). We adopt Cooper and Yoo (1999) for 
calculating the posterior probability of a network structure given a mix of experimental 
and observational data. Suppose that Nij^ is the number of data points for which Zi is 
not fixed by intervention and is found in state j with its parent set in joint state k. Then, 
the collection of counts N = {Nijf.} is the sufficient statistic for (Ellis and Wong 2008). 
Let |nf I be the size of the parent set of Zi. The prior distribution over network structures 
is specified as vr(G) oc /3^j I^?', (3 € (0, 1), which penalizes graphs with a large number of 
edges. With a product-Dirichlet prior for 0, the posterior distribution [G \ N] (Cooper and 
Herskovits 1992) is 

ml Qi 

p{G I N) oc n /3i"?i n 

i=l [ k=l 

where aijk = a/{riqi) is the pseudo count for the causal probability 6ijk in the product- 
Dirichlet prior and iVj.fc = Nij^ (similarly for aj.fc). The hyperparameters in the prior 
distributions are chosen as /3 = 0.1 and a = 1. 

5.2 MD sampling over DAGs 

The space of DAGs is discrete in nature. We define domains of attraction for P{G \ N) 
(|17|) with a move set composed of addition, deletion and reversal of an edge. Given a DAG 
Ga, we say that another DAG Gh is a neighbor of Ga if G^ can be obtained via a single 
move starting from Ga, i.e., by adding, deleting or reversing an edge of Ga- Denote by 
ngh{Ga) all the neighbors of Ga and let ngh(Ga) = ngh{Ga) U {Ga}- A DAG G* is defined 
as a local mode of a probability density (mass) function p{G) if p{G*) > p{G') for every 



r(ai., 



T r(Qijfc + N. 



T{ai.k + Ni.k) n T{aijk) 



ijk ) 



(17) 
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G' G ngh{G*). Let G° be a DAG and define recursively 



= argmax for t = 0, 1, . . .. (18) 

Tliat is, we recursively find the single move that leads to the greatest increase in p until a 
local mode is reached, which can be viewed as a discrete counterpart of the gradient descent 
algorithm. If there are more than one maximum in (llSp with an identical function value, we 
take the first maximum according to a fixed ordering of the neighbors. We call this recursion 
the steepest neighbor ascent (SNA). Based on SNA, we define the domain partition index 
I{G) and the attraction domains of p in the same manner as for a differentiable density 
(Equation [2] and Definition [T]) . 

The target distribution in this application is the posterior distribution P{G \ N) and 
we define working density P{G \ N; W) similarly as in ([6]). To implement the MD sampler 
for DAGs, we employ the move set as the local proposal and develop the following mixed 
jump. For a DAG G, we define an edge variable Ef^ for every pair of nodes Zi and Zj 
{i < j) such that = 1 if is a parent of Zj, Ef^ = — 1 if Zj is a parent of Zi, and 
Ef. = otherwise. Given a DAG v, let C(G; v) = {Ca{G; v), Gd{G; v),Cr{G; v)) be a map 
of G, where 

Ca{G;u) = X:,<,l(i^,^/0,£;,^.=0), 

Cd{G-u) = E^<jHE^j=0,E^^^O), 

Cr{G;u) = Z^<,HE^■E'^, = -^■ 
In words, C(G;z^) gives the numbers of additions, deletions and reversals needed to obtain 
G from I'. Let T = m{m — l)/2 be the total number of node pairs and \Eiy\ be the 
number of edges in v. Then, the number of common edges between G and i' is \E^\ — 
[Cr(G;z^) + CdiG-jv)] and the number of node pairs with no edge in either DAG is T — 
\E^\ - Ca{G; v). Given a set of local modes of P{G \ N), {uk}kLi. let gfc(G) = C(G; Vk) in 
the update of ([8]) of Algorithm [U where V^, = {v\ g^-,v\ ^,v\, ^) is a vector. As ?i — )• oo, 
yn E[C{G; i^fe) I G € Dfc], where the expectation is taken with respect to the limiting 
working density P[G \ N;W*) (llOp . In the mixed jump, after a local mode is randomly 
chosen, we sequentially modify the edge variables of Vk to propose a new DAG Y . The 
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proposal is designed according to V|,, the current estimate of the expected numbers of 
additions, deletions and reversals of DAGs in the domain relative to the mode Vk- Let 
lE'fcl be the number of edges in v^. If E^^ ^ 0, we propose to reverse, delete, or retain 
the edge E^^ (i.e., EJ- = —E^j,0,otE^j) with probabilities proportional to the vector 
{vl J., I'fc rf, l-Efc I — {vl. J, + vj^ rf)) + ft, where 6 > is a small prior count added to each category. 
Analogously, if E'^j = we propose Efj = 0,1, or — 1 with probabilities proportional to 
(T — \Ek\ — vj^ ^,vl, ^/2, vl, Ji) + h. Lastly, to ensure a proposed graph is acyclic, a check for 
cycles is performed when we propose to add or reverse an edge in either the local proposal 
or the mixed jump. If the resulting graph is cyclic, we suppress the probability for the 
corresponding move. 

Following the common practice in structural learning of discrete BNs, we set an upper 
bound for the number of parents (indegree) of a node. In all the following examples and 
applications, this upper bound is chosen to be four. We are interested in the posterior 
expected adjacency matrix A = {aij)mxm and its domain-based representation, where aij 
(1 ^ hj ^ n-) is the posterior probability for a directed edge from Zj to Zj. For each 
identified local mode z^^, we estimate the probability mass of its attraction domain 
and the conditional expected adjacency matrix A^ on the domain. Then, A is estimated 
by A = Y^k ^^k^k- 

5.3 Simulation 

We simulated data from two BNs, each of six binary variables (m = 6, = 2,Vi). This is 
the maximum number of nodes for which we can enumerate all DAGs, numbering about 
four million, to obtain true posterior distributions and domain-based representations as the 
ground truth for testing a computational method. The first network has a chain structure 
in which Zi is the only parent of Zj+i for i = 1, . . . , 5 and Zi has no parent. The second 
network has a more complex structure shown in Figure [2j We simulated 50 datasets inde- 
pendently from each network. In each dataset, 20% of the data points were generated with 
interventions. Please see supplementary document for data simulation details. 

The MD sampler Wcis cipplied to th.6 100 datasGts with. L — 15, — 10, Pmx — 

0.1, 

K* = 100 and a total of 5M iterations with the first 50K as burn-in iterations. To verify its 
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Table 2: Comparison on simulated data from two BNs 
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2.96 


13.6 


1.01 
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1.5e-4 


7.17 


5.09 


0.98 



The top and bottom panels report the results for the chain and the graph networks, re- 
spectively. For each estimate, reported are the MSE of the MD sampler and the RMSEs 
(ratios) of the other methods relative to the MD sampler. 

performance, we compared identified local modes, estimated probability masses {logAfc}, 
conditional expected adjacency matrices {A^}, and expected adjacency matrix A to their 
respective true values obtained via enumerating all DAGs. Our enumeration confirms that 
the posterior distributions indeed have multiple local modes. The chain and the graph 
(Figure [2|) networks have on average 3.57 and 7.06 modes over the simulated datasets, 
respectively, and the maximum number of modes is 29 for the chain and 34 for the graph. 
As reported in Table [21 the MD sampler did not miss a single local mode for any dataset, 
which demonstrates its global search ability. Recall that all recorded modes are detected in 
the burn-in algorithm. In fact, all modes, including the global mode, were identified within 
lOK iterations for every dataset. This observation confirms the notion that the burn-in 
algorithm alone may serve as a powerful optimization algorithm (Remark We calculated 
the MSE of the vector (log Ai, . . . , log Xr), where K is the number of local modes, and the 
average MSE of Ai, . . . , Ar. When calculating the MSE of the log probability vector, we 
ignored those tiny domains with a probability mass < 10~^. These estimates are seen to be 
very accurate as reported in Table [2j 

We also applied the MD sampler with pmx = (MDq) and the WL algorithm with 
the same parameter setting to these datasets (Table [2]). The degraded performance of 
MDq demonstrates the effectiveness of the mixed jump for sampling DAGs. The WL al- 
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gorithm missed 0.12 modes on average for the second network, and its estimation of the 
DR {(Afc,Afc)} was much less accurate compared to that of the MD sampler. The average 
MSE of Ai, . . . , Ak and the MSE of (log Xk)i:K were more than 10 and 1,000 times greater 
than those of the MD sampler, respectively. The huge MSE of the (logA^) constructed 
by the WL algorithm was often due to severe underestimation of the probability masses of 
domains sampled insufficiently. This result implies that without domain partitioning, the 
WL algorithm is unable to estimate domain-based representations for a BN of a moderately 
complicated structure. Since the number of local modes often increases very fast with the 
complexity of a problem, we re-applied the MD sampler with K* = 10 to investigate the 
effect of keeping only a subset of local modes. Obviously, the algorithm missed a few local 
modes when the total number of modes exceeded K* . But in terms of estimating A and A^, 
the performance of the MD sampler with K* = 10 was very comparable to its performance 
when all the local modes were kept (Table [2|) . The probability mass outside the domains 
of recorded modes, Aq = 1 — J2k=i ^k, is less than 0.007 averaging over the 15 datasets 
where the posterior distributions have more than 10 local modes. This confirms that the 
MD sampler indeed captured major modes in the burn-in period. 

6 Protein-signaling networks 
6.1 Background and data 

The ability of cells to properly respond to environment is the basis of development, tissue 
repair, and immunity. Such response is established via information flow along signaling 
pathways mediated by a series of signaling proteins. Cross-talks and interplay between 
pathways reflect the network nature of the interaction among these signaling molecules. 
Construction of signaling networks is an important step towards a global understanding 
of normal cellular responses to various environmental stimuli and more effective treatment 
of complex diseases caused by malfunction of components in a pathway. Causal Bayesian 
networks may be used for modeling signaling networks as the relation among pathway 
components has a natural causal interpretation. That is, the activation or inhibition of a 
set of upstream molecules in a network causes the state change of downstream molecules. 
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An edge from molecule A to molecule B in a signaling network implies that a change in the 
state of A causes a change in the state of B via a direct biochemical reaction. Here, a state 
change refers to a chemical, physical or locational modification of a molecule. However, as 
there may exist mutual regulation between two signaling molecules, the use of DAGs for 
modeling signaling networks is only a first-step approximation. 

In this study, we construct protein-signaling networks from flow cytometry data. Poly- 
chromatic flow cytometry is a high-throughput technique for probing simultaneously the 
(phosphorylation) states of multiple proteins in a single cell. Since measurements are col- 
lected on a cell-by-cell basis, huge amounts of data can be produced in one experiment. 
Sachs et al. (2005) made flow cytometry measurements of 11 proteins and phospholipids 
in the signaling network of human primary naive CD4"'" T cells under nine different exper- 
imental perturbations that either activate or inhibit a particular molecule or activate the 
entire pathway. Note that a perturbation that activates or inhibits a particular molecule 
is essentially an intervention on the molecule, so that causal structures of the underlying 
network may be inferred. Under each perturbation, 600 cells were collected with 11 mea- 
surements for each. The measurements in the data were discretized into three levels, high, 
medium, and low by Sachs et al. In summary, this dataset contains 5,400 data points for 11 
ternary variables. Since naive T cells are essential for the immune system to continuously 
respond to unfamiliar pathogens, extensive studies have been conducted to establish the 
signaling pathways. An annotated signaling network among the 11 molecules, provided by 
Sachs et al., is depicted as a causal Bayesian network in Figure [3l This network contains 
18 edges that are well-established in the literature and two edges (PKC — )■ PKA and Erk 
—7- Akt) reported from recent experiments independent of the flow cytometry data. 

6.2 Predicted networks 

The MD sampler Wets cipplicd. to this datasGt with. L — 20, — 10, Pmx 

= 0.1, and 

K* = 10. The total number of iterations was 5M, of which the first 50K were used for 
burn-in. We estimated the posterior expected adjacency matrix A and its domain-based 
representation. Three predicted networks were constructed by thresholding posterior edge 
probabilities at c = 0.5,0.7,0.9, i.e., an edge from Zj to Zj was predicted if the edge 
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Figure 3: An annotated protein-signaling network in naive CDA'^ T cells. 

probability ciij > c. For simplicity we call such a predicted network a mean network (with 
a threshold c). Table [3] (top panel) reports the number of true positive edges (TP) that 
are both predicted by the MD sampler and annotated in Figure [3] and the number of false 
positive edges (FP) that are predicted but not annotated, together with the (unnormalized) 
log posterior probability of the identified global maximum DAG. To compare the results, we 
re-applied the MD sampler with the sSjIiib pajranictBrs except th.3-t Pmx — 

(MDo in Table E]) 

and applied the WL algorithm with the same parameters as used in the MD sampler to the 
same data. The average result over 20 independent runs of each method is summarized in 
Table [3l In terms of finding the global mode, the MD sampler was much more effective and 
robust than the other two algorithms, reflected by a much higher average log probability 
and a much smaller standard deviation across multiple runs. The MD sampler with or 
without the mixed jump showed comparable results in predicting network structures, and 
both predicted more true positives and fewer false positives than the WL algorithm did 
for all the three thresholds. We noticed that the mean networks constructed with different 
thresholds (c = 0.5,0.7,0.9) were almost identical. This was due to the fact that the 
posterior edge probabilities were close to either 1 or because of the large data size. The 
network constructed from the same data by the order-graph sampler, reported in Figure 
11 of Ellis and Wong (2008), has 9 true positive and 11 false positive edges, which misses 
much more true edges and includes more false positives than the networks predicted by the 
MD sampler. These results demonstrate that the MD sampler is very powerful in learning 
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Table 3: Results on the flow cytometry data 
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The top and bottom panels report the average results over 20 independent runs on the 
full dataset and the average results over ten test datasets in cross validation, respectively. 
Predictive probabilities (Log pred) are reported as log ratios over the predictive probability 
given the mean network of the MD sampler. 

underlying network structures from experimental data compared to other advanced Monte 
Carlo techniques. 

Next, we focus on the estimation of the DR, {(A^, A^) : /c = 0, . . . , if*}, and its scientific 
implications. A network Gk can be constructed for an attraction domain by thresholding 
Afc, the conditional expected adjacency matrix on the domain D^, for k = 1, . . . , K* . To 
distinguish it from the mean network, we call Gk a local network. We take the result of 
a representative run of the MD sampler {pmx = 0.1) to demonstrate local networks with 
the threshold c = 0.9. The parent sets of eight nodes are identical across the K* = 10 
local networks. We report in Table H] the parents of the other three nodes, PLC, PIP3, 
and Erk, which are distinct among the local networks, together with the probability masses 
(logAfc) of the 10 domains and the probabilities of the local modes [log-P(z^fc | N)]. The 
local networks may predict meaningful alternative edges not included in the mean network, 
as illustrated by the result on a particular pathway, Raf Mek Erk (Figure [3|) . This 
expected pathway was predicted by all the local networks and the mean network. However, 
some local networks also contained a direct link from Raf to Erk (Table S]) . As Mek was 
inhibited in one of the experimental conditions, this finding suggests that the cells may 
have another pathway that passes the signal from Raf to Erk via some indirect regulation 
or via molecules not included in this analysis, when Mek is not functioning properly. Such 
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Table 4: Local networks constructed from domain-based representation 
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compensational mechanisms exist widely in many biological networks. Indeed, Raf has been 
reported to enhance the kinase activity of PKC^, an isoform of PKC, although PKC0 is 
unlikely a direct phosphorylation target of Raf (Hindley and Kolch 2007). As indicated by 
Figure [3l Erk is a downstream node of PKC and thus may be regulated indirectly by Raf 
via the enhanced kinase activity of PKC0. Such novel hypotheses could not be proposed 
if we did not construct the DR for the posterior distribution. Clearly, the DR of network 
structures not only gives a detailed landscape of various local domains but also provides 
new insights into the underlying scientific problem. 

Prom Table |4] we find that the probability mass is dominated by the domain of the 
identified global mode with a log probability of —31757. Consistent with the summary 
in Table [3l the MD sampler almost always reached this global mode for different runs. 
On the contrary, the highest mode detected by the WL algorithm with an average log 
probability of —31938 is even much lower than the lowest mode in Table [H In other words, 
the WL algorithm was inevitably trapped to some local modes with negligible probability 
masses. This again demonstrates the advantage of the MD sampler, particularly the burn-in 
algorithm, in finding global modes. Even when the probability mass of the global mode is 
dominant and other domains occupy only a small fraction of the sample space, without the 
domain-partitioning design the WL algorithm may be trapped to a local mode of a tiny 
probability mass and produce severely biased estimates. 
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6.3 Cross validation 

To check the statistical variabihty and the predictive power of our method, we conducted 
ten-fold cross validation on this dataset. We partitioned randomly the 5,400 data points 
into ten subsets of equal sizes. We used nine subsets as training data to learn a mean 
network and calculated the predictive probability of data points in the other subset (test 
data) given the learned mean network. This procedure was repeated 10 times to test on 
every subset. We verified the average accuracy of the mean networks constructed from the 
10 training datasets with different thresholds. The performance of the MD sampler on the 
training datasets was comparable to its performance on the full dataset, which implies its 
robustness to random sampling of input data. The improvement in accuracy (TP/FP) of 
the MD sampler over the other two methods became more significant, especially compared 
to the WL algorithm (Table [3l bottom panel). The average predictive probability of the test 
datasets given the mean networks constructed from the training datasets by each method 
with c = 0.9 is reported in Table[3][Log pred (mean)], from which we see that the predictive 
power of the networks constructed by the two MD samplers was much higher than that of 
the WL algorithm (> 60 in log probability ratio). In addition, we utilized the estimated 
domain-based representations to calculate the predictive probability of a test data point y 
by 

K* 

P{y I {Gk, k]) = kP(y I Gk), (19) 

fc=0 

where P{y \ Gk) is the marginal likelihood of y given the local network G^. This can be 
regarded as a domain-based approximation to the posterior predictive distribution 

P{y\yobs) = Y.P{G\yobs)P{y\G), 

G 

i.e., we use estimated probability masses and conditional mean networks G^ to approxi- 
mate the posterior predictive probability. The advantage is that there is no need to store a 
large posterior sample of networks but only an estimated DR. Since Equation ()19p captures 
the variability among different domains, it is expected to outperform the mean network in 
prediction. In fact, for each method the predictive probability calculated by (|19p [Table [3l 
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Log pred (DR)] was indeed significantly greater than the predictive probability calculated 
given its mean network, especially for the two MD samplers. 

In real applications, we are interested in predicting results for a new experimental 
condition given observed data from other conditions. Thus, we also performed a nine-fold 
cross validation where a training dataset was composed of cells from eight experimental 
conditions and a test dataset only included cells in the other one condition. We applied the 
MD sampler to construct mean networks and DRs {Gk, Xk}j^i from the training datasets. 
The mean networks with c = 0.9 included, on average, 13.2 true edges with 10.8 false 
edges, which was slightly worse than the result from the ten-fold cross validation. The 
degraded performance is expected as removing all cells from one experimental perturbation 
will increase the uncertainty in determining the directionality of the network. The domain- 
based prediction (|19p was compared against the annotated network G* given in Figure O 
which presumably has the highest predictive power, by evaluating the log likelihood ratio 
(LLR) logi? = log[i-'(y I {Gfc, Afc})/P(y | G*)], where y is a test data point. The average 
logi? over all test data points was —0.062, and thus the predictive probability for a new 
observation given the constructed DR is expected to be higher than 94%(= e^*^-*^^^) of its 
likelihood given the annotated graph. This demonstrates the high predictive power of the 
domain-based prediction constructed by our method. As expected, the average LLR of the 
mean networks over G* was 26% lower than the average of log R. 

7 Discussion 

The central idea of this article is to construct domain-based representations with the MD 
sampler. Related works have been seen in the physics literature under the name of super- 
position approximation. Please see Wales and Bogdan (2006) for a recent review. Given a 
Boltzmann distribution pb{x',t) oc exp[— i7(x)/r], a superposition approach identifies the 
local minima of H{x), i.e., the local modes pb{x;t), and approximates H{x.) on the 
attraction domain of a local minimum by quadratic or high-order functions. The approxi- 
mation is often proposed based on expert knowledge about the physical model under study. 
Expectations with respect to p_b(x; t) are then estimated by summing over approximations 
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from identified domains. The accuracy of this approach largely depends on the employed 
approximation to H{x.) on a domain and thus may not work well for an arbitrary distri- 
bution. The MD sampler differs in that domain-based representations are constructed by 
Monte Carlo sampling which is able to provide accurate estimation with large-size samples; 
no expert knowledge about the target distribution is needed. In addition, our method also 
contains a coherent component for finding local modes, while the superposition approxima- 
tion works more like a two-step approach. 

From a computational perspective, the MD sampler integrates Monte Carlo and deter- 
ministic optimization. A few other methods also have the two ingredients, such as Monte 
Carlo optimization (Li and Scheraga 1987), the basin hopping algorithm (Wales and Doye 
1997), and conjugate gradient Monte Carlo (CGMC) (Liu, Liang, and Wong 2000). In 
Monte Carlo optimization and the basin hopping algorithm, the target distribution p(x) is 
modified to p{x) = p^i^k) for all x € D^, where Dk is the attraction domain of the mode 
Uk- Then a Metropolis- type MCMC is used to sample from p, in which a local optimiza- 
tion algorithm is employed at each iteration to find i5(X*) for the current state X*. These 
methods have been applied to identification of minimum-energy structures of proteins and 
other molecules. However, its application to other fields is limited as the modified density 
p(x) may be improper when the sample space is unbounded. In CGMC, a population of 
samples is evolved and a line sampling step (Gilks, Roberts, and George 1994) is performed 
on a sample along a direction pointing to a local mode found by local optimization initiated 
at another sample. In this way promising proposal may be constructed by borrowing local 
mode information from other samples. A possible future work on the MD sampler is to 
utilize a population of samples. Because local modes are recorded, similar proposals as the 
line sampling can be developed for the MD sampler to further enhance sampling effective- 
ness. Another future direction is to construct disconnectivity graphs (Becker and Karplus 
1997) or trees of sublevel sets (Zhou and Wong 2008) from samples generated by the MD 
sampler. Since samples have been partitioned into domains of attraction, we only need to 
determine the barrier between a pair of domains, defined by max^g^ minxesp(x), where S 
is the collection of all the paths between the two domains. A few candidate approaches 
towards this direction are under current investigation (Zhou 2011). 
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Appendix: Theoretical analysis 

In this Appendix we establish the convergence and ergodicity properties of the MD sampler. 
Our analysis is conducted for a doubly adaptive MCMC, i.e., both the target distribution 
and the proposal may change along the iteration, which includes the MD sampler as a 
special case. Furthermore, the MWL design (Routined]) is employed to adjust 7^. 

Assume that the sample space X is equipped with a countably generated cr-field, B{X). 
Let {r^j}^^^ be a partition of X, where each Afj is nonempty, and B{Xi) = {A € B{X) : A C 
Xi} for i = 1, . . . ,K. Let u = € ^ and cj) G ^ he two vectors of real parameters. 

Denote the product parameter space by = x <I> and write 6 = (a;, 0) G 0. For a; € 0, 
define working density 

K 

^e--'p(x)l(xG^,). (20) 

4 = 1 

Let g(x, •) and t^{x, •), G <I>, be two transition kernels on (X, B{X)). Hereafter, the same 
notation will be used for a kernel and its density with respect to the Lebesgue measure on 
X, e.g., g(x,dy) = g(x,y)dy. For j = 0,1, define Qj- <^(x, •) = (1 - i)g(x, •) +jt^{yL,-). 
Let Kj Q be the MH transition kernel with p^^ as the target distribution and Qj,ct> as the 
proposal, i.e., 

Kj^0{x,dy) = Sj^ei^,dy) + l(x G dy) 

where Sj^0{x,dy) = Qj^^{x,dy)mm[l,p^{y)Qj^^{y,x.)/p^{x.)Qj^^{x,y)], representing an 
accepted move. As Qo^ct> = Q, Kq q and 5*0,0 do not depend on 4> and thus reduce to -fCo.oj and 
Sq,lj, respectively. Furthermore, if we let a; = then Paj(x) = p(x), in which case we simply 
use Kq and Sq. Given a G [0, 1), define a mixture proposal = {l — a)q + at(j,, its accepted 
move Se = (1 — a)S'o,tj + aS'i,e, and the corresponding MH kernel Kq = (1 — a)Ko,tj + o-f^l,0• 
Table[5]summarizes the notations, from left to right, for target distributions, proposals, MH 
kernels, and accepted moves for different scenarios involved in this analysis. 

Denote by Z[u!) the normalizing constant of (I20p . Then Z{u:i) = X]i=i Zi{ijJi), where 
ZiiiOi) = e~^' f-^ p{x.)dx. Let U he a nonempty subset of {1, . . . , k} and Xjj = {J^^jj Xi. 
Given a map g : ^ — )• let fj,^^ij{u) = E'[g(X) | X G Xu] with respect to Pi^. Define a 



1 ~ / Sj^g{x,dz) 
Jx 
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Table 5: Summary of notations 



mixture : p^j Q4, Kg Sg 
j = : q i^o.o) 5*0,0; 
j = 0, a; = : p q Kq Sq 
i = 1 : Pcj K10 Si^e 



map H : 6 X A" — )• 9 by 



H(0,x) = [(l(x G Xi) - (g(x) - 0)l(x G Xu)] 



and the mean field F(0) = ii{0,x)pt^(x)dx, i.e. 



F(0) 



V Z(a;) K 



) 




(/^g,c/(^) - 0) 



Consider the equation F{6) = 0. Let u* = [log Zj(0)]i:K and 4>* = ii^^u{u*). As F(0) is 
invariant to translation of a; by a scalar: cj ^ (cj + /3) ={uJi + /3)i:k, /3 G M, the solution 
set to this equation is {0*(/3) =(0^* + /3, 0*)} H = 0*. Set 71 = 1 and choose an arbitrary 
point (x, 6) G A" X to initialize (X^, 0^). A doubly adaptive MCMC is employed to find a 
solution 9*{P) G 0* and to estimate fih{u:*) = /i(x)poj* (x)(ix for a function h : X ^ M.. 

Algorithm 3 (Doubly adaptive MCMC). Choose a fixed a G [0, 1). For t = 1, . . . ,n: 

1. If ^ 0, set X*+i = i and 0*+^ = 6*; otherwise draw X*+i ~ A^X*,-) and set 



2. Determine 7^+1 by the MWL design in Routine [T] with {Xi\ in place of {Dkj}- 

Denote the L2 norm by | • j and let (i(x, A) = infyg^ |x — y|, where x, y are vectors and 
^ is a set. Our goal is to establish that (i(0",0*) almost surely (with respect to the 
probability measure of the process {X*,0*}) and that {X*} is ergodic. Clearly, translation 
of o^* by a scaler does not change the working density p^^t or affect the convergence of 
0* to 0*. Thus, the theory for Algorithm [3] can be applied to the MD sampler with 
reinitialization (Remark [2]). The update of a;*, up to translation by a scalar, and the 
update of correspond to, respectively, the update of W* ([7]) and the update of V^, ([8]) 
for any k in the MD sampler. We state four conditions for establishing the main results. 



Qt+l =0i+^jH(0*,X*+l). 
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(CI) The sample space X is compact, p(x) > for all x € is bounded, and Q* is 
nonempty. The map g and the function h are p-integrable and bounded. 

(C2) There exist Sg > and > such that |x — y| < 6q implies that g(x, y) > Cq for all 
x,y G ;f. 

(C3) There exist an integer i, 6 > and a probability measure u, such that y{Xi) > for 
z = 1, . . . , K and S^(x, A) > 6u{A), Vx G A" and A G B{X). 

(C4) For all x, y G and all (f) ^, t^{x,y) > and logt(^(x, y) has continuous partial 
derivatives with respect to all the components of cf). 

To avoid mathematical complexity, we assume that X is compact (CI). This assump- 
tion does not lose much generality in practice as we may always restrict the sample space to 
{x : p(x) > ep} given an sufficiently small €p. Due to the compactness of X, any continuous 
map and function on X will be bounded. Conditions (C2, C3) are standard conditions on 
the fixed proposal g(x, y) to guarantee irreducibility and aperiodicity of the MH kernel Kq. 
They are satisfied by all the local moves used in this article. A regularity condition on the 
adaptive proposal is specified in (C4). For the mixed jumps in the examples, is either 
the covariance matrix of a multivariate normal distribution or the cell probability vector of 
a multinomial distribution, and (C4) is satisfied. 

Lemma 1. Let a G [0, 1). For any i, j G {1, . . . , k} and 6 e Q, if e'^'"'^^ > ci G (0, 1] then 
i^0(x, A) > (1 - a)ci5o(x, A), Vx G Xi and A G B{Xj). 

Proof. By definition, /^^(x. A) > (l-a)Ko,t^(x, A) > (l-a)5o,t^(x. A) for every 6 = {u, cf)). 
For any x G and y Xj, 



So,Lj (x, dy) = q{x, dy) min 
> ci(/(x, dy) min 



1, c 



1, 



p(x)Q(x,y) 

p(y)g(y,x; 



ci5o(x, (iy). 



p{x)q{x,y) 

Thus, i^0(x, A)>{1- a)ciSo(x, A) for aU A G B{Xj). □ 
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Theorem 2. //(C1)-(C4) hold, then d{e'^ ,Q*) 



a.s. 



and 



1 



n 



n 




(21) 



Proof. Lemma n and (C3) implies that Vx G Xi, K^{^,Xj) > e^z/(;fj) > if e^'-'^J > a, 
where Ci, > 0. By Theorem 4.2 of Atchade and Liu (2010), 



where v'^ = Y17=i 1(^* S -^i)- This implies that the {74} defined by the MWL update 
(Routine [1]) will decrease below any given > after a finite number of iterations, i.e., 
tc < 00, almost surely. Then, Algorithm [3] becomes a stochastic approximation algorithm 
with a deterministic sequence of {74}. According to Proposition 6.1 and Theorem 5.5 in 
Andrieu et al. (2005), we only need to verify the drift conditions (DRIl-3) and assumptions 
(Al, A4) given in that paper to show the convergence of 0". 

Verifying the drift conditions. Let T> be any compact subset of 0, and Pi and P2 be 
the projections of D into Q and <l>, respectively. Since Vi is compact, there is an ej) G (0, 1] 
such that minjj inf^^gx)! e'^'^'^J > e©. By Lemma [T] and (C3), there is a 5x> > such that 



where i and 1/ are defined in (C3). This gives the minorization condition in (DRIl). Given 
(C2) and that Pcj,u: € Di, is bounded away from and 00 under (CI), /^o,a> is irreducible 
and aperiodic for every a; G Pi, according to Theorem 2.2 of Roberts and Tweedie (1996). 
Consequently, for every G P, Kg is also irreducible and aperiodic as a < 1. Let V^(x) = 1 
for all X G It is then easy to verify other conditions in (DRIl). 

Since both g and © are bounded (CI), there is a C2 > such that for all x G A", 



niaxlimsup \vf — t;"| < 00, a.s. 



*J n— >oo 



inf K^(x, A) > 6vi^{A), Vx G ;f, A G B{X) 



(22) 



sup |H(0,x)j < K + |g(x)| + sup \4>\ < 02 



(23) 




H(0,x) - H(0',x)| <\(f)- < C2\e - G'\,\/e,G' G e. 



(24) 
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These two inequalities imply (DRI2) with V{x.) = 1. 

Condition (DRI3) can be verified by the same argument used in Liang et al. (2007) 
once we find a constant C3 > such that 



de. 



< C3Q0(x,y), 



(25) 



for all x,y X , 6 G T> and all i, where 9i is the ith component of = {u, cf)). Denote by (pj 
the jth component of cf). Straightforward calculation leads to \dSo{x,y) /diOi\ < (^^(x, y) 
and 



d(j)j 



R0{x,y)[dlogt^{y,x)/d4>j]at^{x,y), if i?0(x,y) < 1 
[dlogt^{x,y)/d(j)j] at^{x,y), otherwise, 

where i?0(x,y) = Pt^(y)t0(y,x)/pt^(x)t<^(x,y). Condition (C4) with the compactness of 
P2 and X guarantees that 



sup sup \dlogtcj)(x,y) /dcpjl < 00. 

As at(^(x, y) < (^^(x, y), (i25]l holds and (DRI3) is verified. 

Verifying assumptions (Al, A4). It is assumed in assumption (Al) the existence of a 
global Lyapunov function for F(0). Let L(0) = fJ2i=i(.Zi{uJi) - + - ^^g,u{^)\'^ , 

where Z{uj) = Z{u)/n. Using straightforward algebra one can show that 

-Z(VL, F) = X: c,Z, ■ (AZ,)2 + (A0)"r J2 ^Z^^^^f^ + E ^-l^*^!'' 

i=i ueu ueu 

where AZi = Zi{uJi) — Z{u\ Acf) = cf) — fj,g^u{u:), and the arguments (oji and cj) in Zi{uji) 
and Z{u;) have been dropped. Since is bounded, Zi{uji) > > for all i. Because g is 
p-integrable, /Xg^j = /^g,{j} (0) is bounded for all i and 

^ l/^g,«| +™ax|/jg^j| < 2 max |/^g,i| < 00. 

' iaU ' l<i<K ' 

Thus, choosing a sufficiently large C4 ensures that (VL(0),F(0)) < for any 9 G Q with 
equality if and only if 6 E 0*. Furthermore, {0 G : L{6) < Ci} is compact for some Cl > 
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and the closure of L{Q*) has an empty interior. Thus, all the conditions in assumption 
(Al) are satisfied. Since - 0*| < -ft supe,^ |H(0,x)| < 027* ([23]) and 7* = l/{t + ^) for 
t > tc, verifying assumption (A4) is immediate. This completes the proof of the convergence 
of 0". 

The result (|2ip can be established similarly as the proof of Proposition 6.2 in Atchade 
and Liu (2010). We only give an outline here. The drift conditions imply that for any 
6 & V, there exist hg(-x), C5 > 0, and b € (0, 1] such that hg — Kghg = h — fihi^) and 

SUP0g©(||/i0|| + \\Kghg\\) < 00, 
\\he - hg>\\ + \\Kghg - Kg,hg,\\ < C^IO - e'f>,\je,e' G V, 

where Kgheix) = Kg{x,dy)hg{y) and for / : A' ^ M, ||/|| = sup^gA^ l/WI- See 
Proposition 6.1 and assumption (A3) of Andrieu et al. (2005). Then, following an essen- 
tially identical proof to that of Lemma 6.6 in Atchade and Liu (2010), we can show that 
^^]^ t^^[/i(X*~'^^) — fihi^^)] has a finite limit almost surely. Since fih{uj^) ^—^ fihi^*) as 
t — )• 00, Kronecker's lemma applied to the above infinite sum leads to the desired result. 

□ 
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